Customer churn is defined as when customers or subscribers discontinue doing business with a firm or service. Individualized customer retention is tough because most firms have a large number of customers and can't afford to devote much time to each of them. The costs would be too great, outweighing the additional revenue. However, if a corporation could forecast which customers are likely to leave ahead of time, it could focus customer retention efforts only on these "high risk" clients. The ultimate goal is to expand its coverage area and retrieve more customers loyalty. The core to succeed in this market lies in the customer itself. Customer churn is a critical metric because it is much less expensive to retain existing customers than it is to acquire new customers. To reduce customer churn, we need to predict which customers are at high risk of churn.
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
display: table-cell;
text-align: center;
vertical-align: middle;
horizontal-align: middle;
}
h1,h2 {
text-align: center;
background-color: black;
padding: 20px;
margin: 0;
color: yellow;
font-family: ariel;
border-radius: 80px
}
h3 {
text-align: center;
border-style: solid;
border-width: 3px;
padding: 12px;
margin: 0;
color: black;
font-family: ariel;
border-radius: 80px;
border-color: gold;
}
body, p {
font-family: ariel;
font-size: 15px;
color: charcoal;
}
div {
font-size: 14px;
margin: 0;
}
h4 {
padding: 0px;
margin: 0;
font-family: ariel;
color: purple;
}
</style>
""")
Since we know our best customers by segmentation and lifetime value prediction, we should also work hard on retaining them. That’s what makes Retention Rate is one of the most critical metrics.
Retention Rate is an indication of how good is your product market fit (PMF). If your PMF is not satisfactory, you should see your customers churning very soon. One of the powerful tools to improve Retention Rate (hence the PMF) is Churn Prediction. By using this technique, you can easily find out who is likely to churn in the given period.
from datetime import datetime, timedelta,date
import pandas as pd
%matplotlib inline
from sklearn.metrics import classification_report,confusion_matrix
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from __future__ import division
from sklearn.cluster import KMeans
from chart_studio import plotly as py
import plotly.offline as pyoff
import plotly.graph_objs as go
import xgboost as xgb
from sklearn.model_selection import KFold, cross_val_score, train_test_split
pyoff.init_notebook_mode()
df_data = pd.read_csv('Customer_Churn.csv')
df_data.head(10)
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | ... | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No |
| 2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | ... | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | ... | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
| 5 | 9305-CDSKC | Female | 0 | No | No | 8 | Yes | Yes | Fiber optic | No | ... | Yes | No | Yes | Yes | Month-to-month | Yes | Electronic check | 99.65 | 820.5 | Yes |
| 6 | 1452-KIOVK | Male | 0 | No | Yes | 22 | Yes | Yes | Fiber optic | No | ... | No | No | Yes | No | Month-to-month | Yes | Credit card (automatic) | 89.10 | 1949.4 | No |
| 7 | 6713-OKOMC | Female | 0 | No | No | 10 | No | No phone service | DSL | Yes | ... | No | No | No | No | Month-to-month | No | Mailed check | 29.75 | 301.9 | No |
| 8 | 7892-POOKP | Female | 0 | Yes | No | 28 | Yes | Yes | Fiber optic | No | ... | Yes | Yes | Yes | Yes | Month-to-month | Yes | Electronic check | 104.80 | 3046.05 | Yes |
| 9 | 6388-TABGU | Male | 0 | No | Yes | 62 | Yes | No | DSL | Yes | ... | No | No | No | No | One year | No | Bank transfer (automatic) | 56.15 | 3487.95 | No |
10 rows × 21 columns
df_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7043 entries, 0 to 7042 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 7043 non-null object 1 gender 7043 non-null object 2 SeniorCitizen 7043 non-null int64 3 Partner 7043 non-null object 4 Dependents 7043 non-null object 5 tenure 7043 non-null int64 6 PhoneService 7043 non-null object 7 MultipleLines 7043 non-null object 8 InternetService 7043 non-null object 9 OnlineSecurity 7043 non-null object 10 OnlineBackup 7043 non-null object 11 DeviceProtection 7043 non-null object 12 TechSupport 7043 non-null object 13 StreamingTV 7043 non-null object 14 StreamingMovies 7043 non-null object 15 Contract 7043 non-null object 16 PaperlessBilling 7043 non-null object 17 PaymentMethod 7043 non-null object 18 MonthlyCharges 7043 non-null float64 19 TotalCharges 7043 non-null object 20 Churn 7043 non-null object dtypes: float64(1), int64(2), object(18) memory usage: 1.1+ MB
df_data.loc[df_data.Churn=='No','Churn'] = 0
df_data.loc[df_data.Churn=='Yes','Churn'] = 1
df_data.groupby('gender').Churn.mean()
gender Female 0.269209 Male 0.261603 Name: Churn, dtype: float64
df_plot = df_data.groupby('gender').Churn.mean().reset_index()
plot_data = [
go.Bar(
x=df_plot['gender'],
y=df_plot['Churn'],
width = [0.5, 0.5],
marker=dict(
color=['green', 'blue'])
)
]
plot_layout = go.Layout(
xaxis={"type": "category"},
yaxis={"title": "Churn Rate"},
title='Gender',
plot_bgcolor = 'rgb(243,243,243)',
paper_bgcolor = 'rgb(243,243,243)',
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
df_plot = df_data.groupby('Partner').Churn.mean().reset_index()
plot_data = [
go.Bar(
x=df_plot['Partner'],
y=df_plot['Churn'],
width = [0.5, 0.5],
marker=dict(
color=['green', 'blue'])
)
]
plot_layout = go.Layout(
xaxis={"type": "category"},
yaxis={"title": "Churn Rate"},
title='Partner',
plot_bgcolor = 'rgb(243,243,243)',
paper_bgcolor = 'rgb(243,243,243)',
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
df_plot = df_data.groupby('PhoneService').Churn.mean().reset_index()
plot_data = [
go.Bar(
x=df_plot['PhoneService'],
y=df_plot['Churn'],
width = [0.5, 0.5],
marker=dict(
color=['green', 'blue'])
)
]
plot_layout = go.Layout(
xaxis={"type": "category"},
yaxis={"title": "Churn Rate"},
title='Phone Service',
plot_bgcolor = 'rgb(243,243,243)',
paper_bgcolor = 'rgb(243,243,243)',
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
df_plot = df_data.groupby('MultipleLines').Churn.mean().reset_index()
plot_data = [
go.Bar(
x=df_plot['MultipleLines'],
y=df_plot['Churn'],
width = [0.5, 0.5, 0.5],
marker=dict(
color=['green', 'blue', 'orange'])
)
]
plot_layout = go.Layout(
xaxis={"type": "category"},
title='Multiple Lines',
yaxis={"title": "Churn Rate"},
plot_bgcolor = 'rgb(243,243,243)',
paper_bgcolor = 'rgb(243,243,243)',
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
df_plot = df_data.groupby('InternetService').Churn.mean().reset_index()
plot_data = [
go.Bar(
x=df_plot['InternetService'],
y=df_plot['Churn'],
width = [0.5, 0.5, 0.5],
marker=dict(
color=['green', 'blue', 'orange'])
)
]
plot_layout = go.Layout(
xaxis={"type": "category"},
title='Internet Service',
yaxis={"title": "Churn Rate"},
plot_bgcolor = 'rgb(243,243,243)',
paper_bgcolor = 'rgb(243,243,243)',
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
df_plot = df_data.groupby('OnlineSecurity').Churn.mean().reset_index()
plot_data = [
go.Bar(
x=df_plot['OnlineSecurity'],
y=df_plot['Churn'],
width = [0.5, 0.5, 0.5],
marker=dict(
color=['green', 'blue', 'orange'])
)
]
plot_layout = go.Layout(
xaxis={"type": "category"},
yaxis={"title": "Churn Rate"},
title='Online Security',
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
df_plot = df_data.groupby('OnlineBackup').Churn.mean().reset_index()
plot_data = [
go.Bar(
x=df_plot['OnlineBackup'],
y=df_plot['Churn'],
width = [0.5, 0.5, 0.5],
marker=dict(
color=['green', 'blue', 'orange'])
)
]
plot_layout = go.Layout(
xaxis={"type": "category"},
title='Online Backup',
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
df_plot = df_data.groupby('DeviceProtection').Churn.mean().reset_index()
plot_data = [
go.Bar(
x=df_plot['DeviceProtection'],
y=df_plot['Churn'],
width = [0.5, 0.5, 0.5],
marker=dict(
color=['green', 'blue', 'orange'])
)
]
plot_layout = go.Layout(
xaxis={"type": "category"},
title='Device Protection',
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
df_plot = df_data.groupby('TechSupport').Churn.mean().reset_index()
plot_data = [
go.Bar(
x=df_plot['TechSupport'],
y=df_plot['Churn'],
width = [0.5, 0.5, 0.5],
marker=dict(
color=['green', 'blue', 'orange'])
)
]
plot_layout = go.Layout(
xaxis={"type": "category"},
title='Tech Support',
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
df_plot = df_data.groupby('StreamingTV').Churn.mean().reset_index()
plot_data = [
go.Bar(
x=df_plot['StreamingTV'],
y=df_plot['Churn'],
width = [0.5, 0.5, 0.5],
marker=dict(
color=['green', 'blue', 'orange'])
)
]
plot_layout = go.Layout(
xaxis={"type": "category"},
title='Streaming TV',
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
df_plot = df_data.groupby('StreamingMovies').Churn.mean().reset_index()
plot_data = [
go.Bar(
x=df_plot['StreamingMovies'],
y=df_plot['Churn'],
width = [0.5, 0.5, 0.5],
marker=dict(
color=['green', 'blue', 'orange'])
)
]
plot_layout = go.Layout(
xaxis={"type": "category"},
title='Streaming Movies',
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
df_plot = df_data.groupby('Contract').Churn.mean().reset_index()
plot_data = [
go.Bar(
x=df_plot['Contract'],
y=df_plot['Churn'],
width = [0.5, 0.5, 0.5],
marker=dict(
color=['green', 'blue', 'orange'])
)
]
plot_layout = go.Layout(
xaxis={"type": "category"},
title='Contract',
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
df_plot = df_data.groupby('PaperlessBilling').Churn.mean().reset_index()
plot_data = [
go.Bar(
x=df_plot['PaperlessBilling'],
y=df_plot['Churn'],
width = [0.5, 0.5, 0.5],
marker=dict(
color=['green', 'blue', 'orange'])
)
]
plot_layout = go.Layout(
xaxis={"type": "category"},
title='Paperless Billing',
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
df_plot = df_data.groupby('PaymentMethod').Churn.mean().reset_index()
plot_data = [
go.Bar(
x=df_plot['PaymentMethod'],
y=df_plot['Churn'],
width = [0.5, 0.5, 0.5,0.5],
marker=dict(
color=['green', 'blue', 'orange','red'])
)
]
plot_layout = go.Layout(
xaxis={"type": "category"},
title='Payment Method',
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
df_data.tenure.describe()
count 7043.000000 mean 32.371149 std 24.559481 min 0.000000 25% 9.000000 50% 29.000000 75% 55.000000 max 72.000000 Name: tenure, dtype: float64
df_plot = df_data.groupby('tenure').Churn.mean().reset_index()
plot_data = [
go.Scatter(
x=df_plot['tenure'],
y=df_plot['Churn'],
mode='markers',
name='Low',
marker= dict(size= 7,
line= dict(width=1),
color= 'blue',
opacity= 0.8
),
)
]
plot_layout = go.Layout(
yaxis= {'title': "Churn Rate"},
xaxis= {'title': "Tenure"},
title='Tenure based Churn rate',
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
def order_cluster(cluster_field_name, target_field_name,df,ascending):
new_cluster_field_name = 'new_' + cluster_field_name
df_new = df.groupby(cluster_field_name)[target_field_name].mean().reset_index()
df_new = df_new.sort_values(by=target_field_name,ascending=ascending).reset_index(drop=True)
df_new['index'] = df_new.index
df_final = pd.merge(df,df_new[[cluster_field_name,'index']], on=cluster_field_name)
df_final = df_final.drop([cluster_field_name],axis=1)
df_final = df_final.rename(columns={"index":cluster_field_name})
return df_final
sse={}
df_cluster = df_data[['tenure']]
for k in range(1, 10):
kmeans = KMeans(n_clusters=k, max_iter=1000).fit(df_cluster)
df_cluster["clusters"] = kmeans.labels_
sse[k] = kmeans.inertia_
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.show()
C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\Users\USER\AppData\Local\Temp\ipykernel_9312\1241920405.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\Users\USER\AppData\Local\Temp\ipykernel_9312\1241920405.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\Users\USER\AppData\Local\Temp\ipykernel_9312\1241920405.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\Users\USER\AppData\Local\Temp\ipykernel_9312\1241920405.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\Users\USER\AppData\Local\Temp\ipykernel_9312\1241920405.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\Users\USER\AppData\Local\Temp\ipykernel_9312\1241920405.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\Users\USER\AppData\Local\Temp\ipykernel_9312\1241920405.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\Users\USER\AppData\Local\Temp\ipykernel_9312\1241920405.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\Users\USER\AppData\Local\Temp\ipykernel_9312\1241920405.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
kmeans = KMeans(n_clusters=3)
kmeans.fit(df_data[['tenure']])
df_data['TenureCluster'] = kmeans.predict(df_data[['tenure']])
df_data = order_cluster('TenureCluster', 'tenure',df_data,True)
df_data.groupby('TenureCluster').tenure.describe()
C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| TenureCluster | ||||||||
| 0 | 2878.0 | 7.512509 | 5.977337 | 0.0 | 2.0 | 6.0 | 12.0 | 20.0 |
| 1 | 1926.0 | 33.854102 | 8.208706 | 21.0 | 26.0 | 34.0 | 41.0 | 48.0 |
| 2 | 2239.0 | 63.048682 | 7.478229 | 49.0 | 56.0 | 64.0 | 70.0 | 72.0 |
df_data['TenureCluster'] = df_data["TenureCluster"].replace({0:'Low',1:'Mid',2:'High'})
df_plot = df_data.groupby('TenureCluster').Churn.mean().reset_index()
plot_data = [
go.Bar(
x=df_plot['TenureCluster'],
y=df_plot['Churn'],
width = [0.5, 0.5, 0.5,0.5],
marker=dict(
color=['green', 'blue', 'orange','red'])
)
]
plot_layout = go.Layout(
xaxis={"type": "category","categoryarray":['Low','Mid','High']},
title='Tenure Cluster vs Churn Rate',
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
df_plot = df_data.copy()
df_plot['MonthlyCharges'] = df_plot['MonthlyCharges'].astype(int)
df_plot = df_plot.groupby('MonthlyCharges').Churn.mean().reset_index()
plot_data = [
go.Scatter(
x=df_plot['MonthlyCharges'],
y=df_plot['Churn'],
mode='markers',
name='Low',
marker= dict(size= 7,
line= dict(width=1),
color= 'blue',
opacity= 0.8
),
)
]
plot_layout = go.Layout(
yaxis= {'title': "Churn Rate"},
xaxis= {'title': "Monthly Charges"},
title='Monthly Charge vs Churn rate',
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
sse={}
df_cluster = df_data[['MonthlyCharges']]
for k in range(1, 10):
kmeans = KMeans(n_clusters=k, max_iter=1000).fit(df_cluster)
df_cluster["clusters"] = kmeans.labels_
sse[k] = kmeans.inertia_
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.show()
C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\Users\USER\AppData\Local\Temp\ipykernel_9312\3820317302.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\Users\USER\AppData\Local\Temp\ipykernel_9312\3820317302.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\Users\USER\AppData\Local\Temp\ipykernel_9312\3820317302.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\Users\USER\AppData\Local\Temp\ipykernel_9312\3820317302.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\Users\USER\AppData\Local\Temp\ipykernel_9312\3820317302.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\Users\USER\AppData\Local\Temp\ipykernel_9312\3820317302.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\Users\USER\AppData\Local\Temp\ipykernel_9312\3820317302.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\Users\USER\AppData\Local\Temp\ipykernel_9312\3820317302.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\Users\USER\AppData\Local\Temp\ipykernel_9312\3820317302.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
kmeans = KMeans(n_clusters=3)
kmeans.fit(df_data[['MonthlyCharges']])
df_data['MonthlyChargeCluster'] = kmeans.predict(df_data[['MonthlyCharges']])
df_data = order_cluster('MonthlyChargeCluster', 'MonthlyCharges',df_data,True)
df_data.groupby('MonthlyChargeCluster').MonthlyCharges.describe()
C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| MonthlyChargeCluster | ||||||||
| 0 | 1892.0 | 23.384619 | 5.660437 | 18.25 | 19.80 | 20.40 | 25.0500 | 42.40 |
| 1 | 2239.0 | 61.628808 | 10.441432 | 42.60 | 51.80 | 61.55 | 70.7000 | 77.80 |
| 2 | 2912.0 | 94.054258 | 10.343944 | 77.85 | 85.05 | 93.90 | 101.9125 | 118.75 |
df_data['MonthlyChargeCluster'] = df_data["MonthlyChargeCluster"].replace({0:'Low',1:'Mid',2:'High'})
df_plot = df_data.groupby('MonthlyChargeCluster').Churn.mean().reset_index()
plot_data = [
go.Bar(
x=df_plot['MonthlyChargeCluster'],
y=df_plot['Churn'],
width = [0.5, 0.5, 0.5],
marker=dict(
color=['green', 'blue', 'orange'])
)
]
plot_layout = go.Layout(
xaxis={"type": "category","categoryarray":['Low','Mid','High']},
title='Monthly Charge Cluster vs Churn Rate',
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
df_data[pd.to_numeric(df_data['TotalCharges'], errors='coerce').isnull()]
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | TenureCluster | MonthlyChargeCluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 91 | 3115-CZMZD | Male | 0 | No | Yes | 0 | Yes | No | No | No internet service | ... | No internet service | No internet service | Two year | No | Mailed check | 20.25 | 0 | Low | Low | |
| 136 | 4367-NUYAO | Male | 0 | Yes | Yes | 0 | Yes | Yes | No | No internet service | ... | No internet service | No internet service | Two year | No | Mailed check | 25.75 | 0 | Low | Low | |
| 416 | 7644-OMVMY | Male | 0 | Yes | Yes | 0 | Yes | No | No | No internet service | ... | No internet service | No internet service | Two year | No | Mailed check | 19.85 | 0 | Low | Low | |
| 478 | 3213-VVOLG | Male | 0 | Yes | Yes | 0 | Yes | Yes | No | No internet service | ... | No internet service | No internet service | Two year | No | Mailed check | 25.35 | 0 | Low | Low | |
| 556 | 2520-SGTTA | Female | 0 | Yes | Yes | 0 | Yes | No | No | No internet service | ... | No internet service | No internet service | Two year | No | Mailed check | 20.00 | 0 | Low | Low | |
| 668 | 2923-ARZLG | Male | 0 | Yes | Yes | 0 | Yes | No | No | No internet service | ... | No internet service | No internet service | One year | Yes | Mailed check | 19.70 | 0 | Low | Low | |
| 1976 | 4472-LVYGI | Female | 0 | Yes | Yes | 0 | No | No phone service | DSL | Yes | ... | Yes | No | Two year | Yes | Bank transfer (automatic) | 52.55 | 0 | Low | Mid | |
| 2114 | 1371-DWPAZ | Female | 0 | Yes | Yes | 0 | No | No phone service | DSL | Yes | ... | Yes | No | Two year | No | Credit card (automatic) | 56.05 | 0 | Low | Mid | |
| 2995 | 4075-WKNIU | Female | 0 | Yes | Yes | 0 | Yes | Yes | DSL | No | ... | Yes | No | Two year | No | Mailed check | 73.35 | 0 | Low | Mid | |
| 3008 | 2775-SEFEE | Male | 0 | No | Yes | 0 | Yes | Yes | DSL | Yes | ... | No | No | Two year | Yes | Bank transfer (automatic) | 61.90 | 0 | Low | Mid | |
| 4249 | 5709-LVOEQ | Female | 0 | Yes | Yes | 0 | Yes | No | DSL | Yes | ... | Yes | Yes | Two year | No | Mailed check | 80.85 | 0 | Low | High |
11 rows × 23 columns
len(df_data[pd.to_numeric(df_data['TotalCharges'], errors='coerce').isnull()])
11
df_data.loc[pd.to_numeric(df_data['TotalCharges'], errors='coerce').isnull(),'TotalCharges'] = np.nan
df_data = df_data.dropna()
df_data['TotalCharges'] = pd.to_numeric(df_data['TotalCharges'], errors='coerce')
df_plot = df_data.copy()
df_plot['TotalCharges'] = df_plot['TotalCharges'].astype(int)
df_plot = df_plot.groupby('TotalCharges').Churn.mean().reset_index()
plot_data = [
go.Scatter(
x=df_plot['TotalCharges'],
y=df_plot['Churn'],
mode='markers',
name='Low',
marker= dict(size= 7,
line= dict(width=1),
color= 'blue',
opacity= 0.8
),
)
]
plot_layout = go.Layout(
yaxis= {'title': "Churn Rate"},
xaxis= {'title': "Total Charges"},
title='Total Charge vs Churn rate',
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
sse={}
df_cluster = df_data[['TotalCharges']]
for k in range(1, 10):
kmeans = KMeans(n_clusters=k, max_iter=1000).fit(df_cluster)
df_cluster["clusters"] = kmeans.labels_
sse[k] = kmeans.inertia_
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.show()
C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\Users\USER\AppData\Local\Temp\ipykernel_9312\3128384594.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\Users\USER\AppData\Local\Temp\ipykernel_9312\3128384594.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\Users\USER\AppData\Local\Temp\ipykernel_9312\3128384594.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\Users\USER\AppData\Local\Temp\ipykernel_9312\3128384594.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\Users\USER\AppData\Local\Temp\ipykernel_9312\3128384594.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\Users\USER\AppData\Local\Temp\ipykernel_9312\3128384594.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\Users\USER\AppData\Local\Temp\ipykernel_9312\3128384594.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\Users\USER\AppData\Local\Temp\ipykernel_9312\3128384594.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning C:\Users\USER\AppData\Local\Temp\ipykernel_9312\3128384594.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
kmeans = KMeans(n_clusters=3)
kmeans.fit(df_data[['TotalCharges']])
df_data['TotalChargeCluster'] = kmeans.predict(df_data[['TotalCharges']])
df_data = order_cluster('TotalChargeCluster', 'TotalCharges',df_data,True)
df_data.groupby('TotalChargeCluster').TotalCharges.describe()
C:\Users\USER\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| TotalChargeCluster | ||||||||
| 0 | 4142.0 | 680.650398 | 567.014323 | 18.80 | 160.8125 | 531.85 | 1131.1125 | 1951.0 |
| 1 | 1611.0 | 3239.588206 | 808.969674 | 1952.25 | 2515.7500 | 3182.95 | 3945.7000 | 4740.0 |
| 2 | 1279.0 | 6268.911767 | 1013.644373 | 4741.45 | 5438.3250 | 6130.85 | 7030.9750 | 8684.8 |
df_data['TotalChargeCluster'] = df_data["TotalChargeCluster"].replace({0:'Low',1:'Mid',2:'High'})
df_plot = df_data.groupby('TotalChargeCluster').Churn.mean().reset_index()
plot_data = [
go.Bar(
x=df_plot['TotalChargeCluster'],
y=df_plot['Churn'],
width = [0.5, 0.5, 0.5],
marker=dict(
color=['green', 'blue', 'orange'])
)
]
plot_layout = go.Layout(
xaxis={"type": "category","categoryarray":['Low','Mid','High']},
title='Total Charge Cluster vs Churn Rate',
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
df_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 7032 entries, 0 to 7031 Data columns (total 51 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 7032 non-null object 1 gender 7032 non-null int32 2 SeniorCitizen 7032 non-null int64 3 Partner 7032 non-null int32 4 Dependents 7032 non-null int32 5 tenure 7032 non-null int64 6 PhoneService 7032 non-null int32 7 PaperlessBilling 7032 non-null int32 8 MonthlyCharges 7032 non-null float64 9 TotalCharges 7032 non-null float64 10 Churn 7032 non-null int32 11 MultipleLines_No 7032 non-null uint8 12 MultipleLines_No_phone_service 7032 non-null uint8 13 MultipleLines_Yes 7032 non-null uint8 14 InternetService_DSL 7032 non-null uint8 15 InternetService_Fiber_optic 7032 non-null uint8 16 InternetService_No 7032 non-null uint8 17 OnlineSecurity_No 7032 non-null uint8 18 OnlineSecurity_No_internet_service 7032 non-null uint8 19 OnlineSecurity_Yes 7032 non-null uint8 20 OnlineBackup_No 7032 non-null uint8 21 OnlineBackup_No_internet_service 7032 non-null uint8 22 OnlineBackup_Yes 7032 non-null uint8 23 DeviceProtection_No 7032 non-null uint8 24 DeviceProtection_No_internet_service 7032 non-null uint8 25 DeviceProtection_Yes 7032 non-null uint8 26 TechSupport_No 7032 non-null uint8 27 TechSupport_No_internet_service 7032 non-null uint8 28 TechSupport_Yes 7032 non-null uint8 29 StreamingTV_No 7032 non-null uint8 30 StreamingTV_No_internet_service 7032 non-null uint8 31 StreamingTV_Yes 7032 non-null uint8 32 StreamingMovies_No 7032 non-null uint8 33 StreamingMovies_No_internet_service 7032 non-null uint8 34 StreamingMovies_Yes 7032 non-null uint8 35 Contract_Month_to_month 7032 non-null uint8 36 Contract_One_year 7032 non-null uint8 37 Contract_Two_year 7032 non-null uint8 38 PaymentMethod_Bank_transfer__automatic_ 7032 non-null uint8 39 PaymentMethod_Credit_card__automatic_ 7032 non-null uint8 40 PaymentMethod_Electronic_check 7032 non-null uint8 41 PaymentMethod_Mailed_check 7032 non-null uint8 42 TenureCluster_High 7032 non-null uint8 43 TenureCluster_Low 7032 non-null uint8 44 TenureCluster_Mid 7032 non-null uint8 45 MonthlyChargeCluster_High 7032 non-null uint8 46 MonthlyChargeCluster_Low 7032 non-null uint8 47 MonthlyChargeCluster_Mid 7032 non-null uint8 48 TotalChargeCluster_High 7032 non-null uint8 49 TotalChargeCluster_Low 7032 non-null uint8 50 TotalChargeCluster_Mid 7032 non-null uint8 dtypes: float64(2), int32(6), int64(2), object(1), uint8(40) memory usage: 769.1+ KB
#import Label Encoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
dummy_columns = [] #array for multiple value columns
for column in df_data.columns:
if df_data[column].dtype == object and column != 'customerID':
if df_data[column].nunique() == 2:
#apply Label Encoder for binary ones
df_data[column] = le.fit_transform(df_data[column])
else:
dummy_columns.append(column)
#apply get dummies for selected columns
df_data = pd.get_dummies(data = df_data,columns = dummy_columns)
df_data[['gender','Partner','TenureCluster_High','TenureCluster_Low','TenureCluster_Mid']].head()
| gender | Partner | TenureCluster_High | TenureCluster_Low | TenureCluster_Mid | |
|---|---|---|---|---|---|
| 0 | 0 | 1 | 0 | 1 | 0 |
| 1 | 0 | 0 | 0 | 1 | 0 |
| 2 | 1 | 0 | 0 | 1 | 0 |
| 3 | 1 | 0 | 0 | 1 | 0 |
| 4 | 1 | 1 | 0 | 1 | 0 |
all_columns = []
for column in df_data.columns:
column = column.replace(" ", "_").replace("(", "_").replace(")", "_").replace("-", "_")
all_columns.append(column)
df_data.columns = all_columns
glm_columns = 'gender'
for column in df_data.columns:
if column not in ['Churn','customerID','gender']:
glm_columns = glm_columns + ' + ' + column
import statsmodels.api as sm
import statsmodels.formula.api as smf
glm_model = smf.glm(formula='Churn ~ {}'.format(glm_columns), data=df_data, family=sm.families.Binomial())
res = glm_model.fit()
print(res.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: Churn No. Observations: 7032
Model: GLM Df Residuals: 7002
Model Family: Binomial Df Model: 29
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2900.5
Date: Sat, 05 Aug 2023 Deviance: 5801.1
Time: 22:57:42 Pearson chi2: 7.55e+03
No. Iterations: 100 Pseudo R-squ. (CS): 0.2833
Covariance Type: nonrobust
===========================================================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------------------------------------
Intercept 0.2384 0.276 0.863 0.388 -0.303 0.780
gender -0.0248 0.065 -0.382 0.702 -0.152 0.103
SeniorCitizen 0.2246 0.085 2.649 0.008 0.058 0.391
Partner 0.0015 0.078 0.019 0.985 -0.152 0.155
Dependents -0.1349 0.090 -1.498 0.134 -0.311 0.042
tenure -0.0624 0.008 -7.394 0.000 -0.079 -0.046
PhoneService 0.2124 0.403 0.527 0.598 -0.577 1.002
PaperlessBilling 0.3493 0.075 4.668 0.000 0.203 0.496
MonthlyCharges -0.0329 0.032 -1.032 0.302 -0.095 0.030
TotalCharges 0.0001 9.97e-05 1.190 0.234 -7.68e-05 0.000
MultipleLines_No -0.1205 0.130 -0.928 0.353 -0.375 0.134
MultipleLines_No_phone_service 0.0261 0.160 0.162 0.871 -0.288 0.340
MultipleLines_Yes 0.3329 0.283 1.175 0.240 -0.222 0.888
InternetService_DSL -0.5935 0.226 -2.625 0.009 -1.037 -0.150
InternetService_Fiber_optic 1.0259 0.578 1.776 0.076 -0.106 2.158
InternetService_No -0.1939 0.091 -2.124 0.034 -0.373 -0.015
OnlineSecurity_No 0.3211 0.108 2.970 0.003 0.109 0.533
OnlineSecurity_No_internet_service -0.1939 0.091 -2.124 0.034 -0.373 -0.015
OnlineSecurity_Yes 0.1112 0.261 0.426 0.670 -0.400 0.623
OnlineBackup_No 0.2160 0.107 2.023 0.043 0.007 0.425
OnlineBackup_No_internet_service -0.1939 0.091 -2.124 0.034 -0.373 -0.015
OnlineBackup_Yes 0.2164 0.261 0.830 0.406 -0.294 0.727
DeviceProtection_No 0.1407 0.107 1.312 0.190 -0.070 0.351
DeviceProtection_No_internet_service -0.1939 0.091 -2.124 0.034 -0.373 -0.015
DeviceProtection_Yes 0.2916 0.261 1.119 0.263 -0.219 0.802
TechSupport_No 0.3079 0.108 2.854 0.004 0.096 0.519
TechSupport_No_internet_service -0.1939 0.091 -2.124 0.034 -0.373 -0.015
TechSupport_Yes 0.1245 0.262 0.476 0.634 -0.389 0.638
StreamingTV_No -0.0600 0.048 -1.246 0.213 -0.154 0.034
StreamingTV_No_internet_service -0.1939 0.091 -2.124 0.034 -0.373 -0.015
StreamingTV_Yes 0.4924 0.340 1.450 0.147 -0.173 1.158
StreamingMovies_No -0.0594 0.049 -1.222 0.222 -0.155 0.036
StreamingMovies_No_internet_service -0.1939 0.091 -2.124 0.034 -0.373 -0.015
StreamingMovies_Yes 0.4918 0.340 1.448 0.148 -0.174 1.157
Contract_Month_to_month 0.7650 0.118 6.493 0.000 0.534 0.996
Contract_One_year 0.0886 0.121 0.732 0.464 -0.148 0.326
Contract_Two_year -0.6152 0.149 -4.135 0.000 -0.907 -0.324
PaymentMethod_Bank_transfer__automatic_ 0.0267 0.097 0.275 0.783 -0.164 0.217
PaymentMethod_Credit_card__automatic_ -0.0573 0.099 -0.580 0.562 -0.251 0.136
PaymentMethod_Electronic_check 0.3188 0.087 3.684 0.000 0.149 0.488
PaymentMethod_Mailed_check -0.0498 0.097 -0.513 0.608 -0.240 0.140
TenureCluster_High 0.5091 0.188 2.702 0.007 0.140 0.879
TenureCluster_Low -0.1138 0.171 -0.666 0.505 -0.449 0.221
TenureCluster_Mid -0.1569 0.119 -1.322 0.186 -0.390 0.076
MonthlyChargeCluster_High 0.0632 0.169 0.374 0.709 -0.268 0.394
MonthlyChargeCluster_Low 0.1008 0.195 0.517 0.605 -0.281 0.483
MonthlyChargeCluster_Mid 0.0745 0.127 0.587 0.558 -0.174 0.323
TotalChargeCluster_High 0.3793 0.204 1.859 0.063 -0.021 0.779
TotalChargeCluster_Low -0.2939 0.177 -1.658 0.097 -0.641 0.053
TotalChargeCluster_Mid 0.1530 0.122 1.255 0.209 -0.086 0.392
===========================================================================================================
np.exp(res.params)
Intercept 1.269263 gender 0.975459 SeniorCitizen 1.251805 Partner 1.001508 Dependents 0.873792 tenure 0.939490 PhoneService 1.236620 PaperlessBilling 1.418124 MonthlyCharges 0.967648 TotalCharges 1.000119 MultipleLines_No 0.886492 MultipleLines_No_phone_service 1.026397 MultipleLines_Yes 1.394959 InternetService_DSL 0.552392 InternetService_Fiber_optic 2.789481 InternetService_No 0.823722 OnlineSecurity_No 1.378712 OnlineSecurity_No_internet_service 0.823722 OnlineSecurity_Yes 1.117628 OnlineBackup_No 1.241102 OnlineBackup_No_internet_service 0.823722 OnlineBackup_Yes 1.241548 DeviceProtection_No 1.151126 DeviceProtection_No_internet_service 0.823722 DeviceProtection_Yes 1.338592 TechSupport_No 1.360530 TechSupport_No_internet_service 0.823722 TechSupport_Yes 1.132565 StreamingTV_No 0.941771 StreamingTV_No_internet_service 0.823722 StreamingTV_Yes 1.636159 StreamingMovies_No 0.942335 StreamingMovies_No_internet_service 0.823722 StreamingMovies_Yes 1.635180 Contract_Month_to_month 2.149005 Contract_One_year 1.092624 Contract_Two_year 0.540559 PaymentMethod_Bank_transfer__automatic_ 1.027099 PaymentMethod_Credit_card__automatic_ 0.944350 PaymentMethod_Electronic_check 1.375474 PaymentMethod_Mailed_check 0.951379 TenureCluster_High 1.663869 TenureCluster_Low 0.892438 TenureCluster_Mid 0.854780 MonthlyChargeCluster_High 1.065234 MonthlyChargeCluster_Low 1.106017 MonthlyChargeCluster_Mid 1.077320 TotalChargeCluster_High 1.461283 TotalChargeCluster_Low 0.745356 TotalChargeCluster_Mid 1.165342 dtype: float64
#create feature set and labels
X = df_data.drop(['Churn','customerID'],axis=1)
y = df_data.Churn
#train and test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, random_state=56)
#building the model
xgb_model = xgb.XGBClassifier(max_depth=5, learning_rate=0.08, objective= 'binary:logistic',n_jobs=-1).fit(X_train, y_train)
print('Accuracy of XGB classifier on training set: {:.2f}'
.format(xgb_model.score(X_train, y_train)))
print('Accuracy of XGB classifier on test set: {:.2f}'
.format(xgb_model.score(X_test[X_train.columns], y_test)))
Accuracy of XGB classifier on training set: 0.84 Accuracy of XGB classifier on test set: 0.80
y_pred = xgb_model.predict(X_test)
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.82 0.91 0.86 246
1 0.71 0.54 0.61 106
accuracy 0.80 352
macro avg 0.77 0.72 0.74 352
weighted avg 0.79 0.80 0.79 352
from xgboost import plot_tree
import graphviz
##set up the parameters
fig, ax = plt.subplots(figsize=(100,100))
plot_tree(xgb_model, ax=ax)
<Axes: >
1/(1+np.exp(-0.032))
from xgboost import plot_importance
fig, ax = plt.subplots(figsize=(10,8))
plot_importance(xgb_model, ax=ax)
<Axes: title={'center': 'Feature importance'}, xlabel='F score', ylabel='Features'>
df_data['proba'] = xgb_model.predict_proba(df_data[X_train.columns])[:,1]
df_data[['customerID', 'proba']].head()
| customerID | proba | |
|---|---|---|
| 0 | 7590-VHVEG | 0.563349 |
| 1 | 6713-OKOMC | 0.111432 |
| 2 | 7469-LKBCI | 0.014819 |
| 3 | 8779-QRDMV | 0.863318 |
| 4 | 1680-VDCWW | 0.043804 |